{ "cells": [ { "cell_type": "markdown", "id": "a6a5bbde-aaba-4423-b7e7-4cf107dd2321", "metadata": {}, "source": [ "# Persistent/Distributed Generation with ``Crop`` Example\n", "\n", "This example shows how to use the {class}`~xyzpy.Crop` object for disk-based\n", "combo running - either for persistent progress or distributed processing.\n", "\n", "First let's define a very simple function, describe it with a\n", "{class}`~xyzpy.Runner` and {class}`~xyzpy.Harvester` and set the combos for\n", "this first set of runs." ] }, { "cell_type": "code", "execution_count": 1, "id": "0ff96a34-1fb4-4dba-8e5c-ac2926967b5b", "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_formats = ['svg']\n", "import xyzpy as xyz\n", "\n", "\n", "def foo(a, b):\n", " return a + b, a - b\n", "\n", "\n", "r = xyz.Runner(foo, [\"sum\", \"diff\"])\n", "h = xyz.Harvester(r, data_name=\"foo_data.h5\")\n", "\n", "combos = {\n", " \"a\": range(0, 10),\n", " \"b\": range(0, 10),\n", "}" ] }, { "cell_type": "markdown", "id": "29d8beb9-ca35-4460-9242-12dcb52fd4fc", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "We could use the harvester to generate data locally. But if we want results to\n", "be written to disk, either for persistence or to run them elsewhere, we need\n", "to create a {class}`~xyzpy.Crop`." ] }, { "cell_type": "code", "execution_count": 2, "id": "82a1f96d-42c5-4699-973f-3ca764f81276", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = h.Crop(name=\"first_run\", batchsize=5)\n", "c" ] }, { "cell_type": "markdown", "id": "4e48bae2-822d-450f-b2a4-333d14260a09", "metadata": {}, "source": [ "## Sow the combos\n", "\n", "A single crop is used for each set of runs/combos, with batchsize setting how many runs should be lumped together (default: 1).\n", "We first **sow** the ``combos`` to disk using the ``Crop``:" ] }, { "cell_type": "code", "execution_count": 3, "id": "75eed5ea-11e9-47d5-a3ad-1534a3327b9e", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Sow: 100%|##########| 100/100 [00:00<00:00, 56148.65it/s]\n" ] } ], "source": [ "c.sow_combos(combos)" ] }, { "cell_type": "markdown", "id": "e440d407-a885-4404-a4c1-a0bba1f1bef6", "metadata": {}, "source": [ "There is now a hidden directory containing everything the crop needs:" ] }, { "cell_type": "code", "execution_count": 4, "id": "44e4bd1c-6013-436e-a125-ce12a23a6d13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m.\u001b[m\u001b[m/ crop example.ipynb\n", "\u001b[34m..\u001b[m\u001b[m/ dask distributed example.ipynb\n", "\u001b[34m.xyz-first_run\u001b[m\u001b[m/ farming example.ipynb\n", "basic output example.ipynb visualize linear algebra.ipynb\n", "complex output example.ipynb\n" ] } ], "source": [ "!ls -a" ] }, { "cell_type": "markdown", "id": "51fb68d2-cc89-419f-9db4-1c0b52aeab7d", "metadata": {}, "source": [ "And inside that are folders for the batches and results, the pickled function, and some other dumped settings:" ] }, { "cell_type": "code", "execution_count": 5, "id": "b4b60af0-5cd2-483a-a61f-45c0bddc2c5d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34mbatches\u001b[m\u001b[m/ \u001b[34mresults\u001b[m\u001b[m/ xyz-function.clpkl xyz-settings.jbdmp\n" ] } ], "source": [ "!ls .xyz-first_run/" ] }, { "cell_type": "markdown", "id": "49704973-0453-440d-b29a-4983f65f1737", "metadata": {}, "source": [ "Once sown, we can check the progress of the ``Crop``:" ] }, { "cell_type": "code", "execution_count": 6, "id": "472a02a0-a40d-470a-99db-b5a86359a588", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "/Users/johnnie/Sync/dev/python/xyzpy/docs/examples/.xyz-first_run\n", "--------------------------------------------------------=========\n", "0 / 20 batches of size 5 completed\n", "[ ] : 0.0%\n", "\n" ] } ], "source": [ "print(c)" ] }, { "cell_type": "markdown", "id": "c41e4778-00bc-46f8-b7e2-119cef2cdb50", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ "There are a hundred combinations, with a batchsize of 5, yielding 20 batches to be processed.\n", "\n", ":::{hint}\n", "As well as ``combos`` you can supply ``cases`` and ``constants`` to\n", "{meth}`~xyzpy.Crop.sow_combos`.\n", ":::" ] }, { "cell_type": "markdown", "id": "2617dc86-fbd6-4f53-874c-f00a489cddec", "metadata": {}, "source": [ "## Grow the results\n", "\n", "Any python process with access to the sown batches in ``.xyz-first_run`` (and the function requirements) can grow the results (you could even zip the folder up and send elsewhere). The process can be run in several ways:\n", "\n", "1. In the ``.xyz-first_run`` folder itself, using e.g:\n", "\n", "```bash\n", "python -c \"import xyzpy; xyzpy.grow(i)\" # with i = 1 ... 20\n", "```\n", "\n", "2. In the current ('parent') folder, one then has to used a named crop to differentiate: e.g:\n", "\n", "```bash\n", "python -c \"import xyzpy; crop=xyzpy.Crop(name='fist_run'); xyzpy.grow(i, crop=crop)\"\n", "```\n", "\n", "3. Somewhere else. Then the parent must be specified too, e.g.:\n", "\n", "```bash\n", "python -c \"import xyzpy; crop=xyzpy.Crop(name='first_run', parent_dir='.../xyzpy/docs/examples'); xyzpy.grow(i, crop=crop)\"\n", "```" ] }, { "cell_type": "markdown", "id": "ab1f95e6-a3ac-4511-99db-63574adebb24", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "To fake this happening we can run {class}`~xyzpy.grow` ourselves (this cell could standalone):" ] }, { "cell_type": "code", "execution_count": 7, "id": "46ce3a43-5f41-4620-975a-a53d4fc6fe6a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: loaded batch 1 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 0, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 2977.64it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 1 completed.\n", "xyzpy: loaded batch 2 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 0, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 3551.49it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 2 completed.\n", "xyzpy: loaded batch 3 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 1, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 5059.47it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 3 completed.\n", "xyzpy: loaded batch 4 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 1, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 3075.45it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 4 completed.\n", "xyzpy: loaded batch 5 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 2, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 4743.61it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 5 completed.\n", "xyzpy: loaded batch 6 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 2, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 4306.27it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 6 completed.\n", "xyzpy: loaded batch 7 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 3, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 4148.67it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 7 completed.\n", "xyzpy: loaded batch 8 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 3, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 5561.26it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 8 completed.\n", "xyzpy: loaded batch 9 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 4, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 5197.40it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 9 completed.\n", "xyzpy: loaded batch 10 of first_run.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "{'a': 4, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 5175.60it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "xyzpy: success - batch 10 completed.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "import xyzpy\n", "\n", "crop = xyzpy.Crop(name=\"first_run\")\n", "for i in range(1, 11):\n", " xyzpy.grow(i, crop=crop)" ] }, { "cell_type": "markdown", "id": "9e4a6083-15cf-4f77-90ba-85e625ad637d", "metadata": {}, "source": [ "And now we can check the progress:" ] }, { "cell_type": "code", "execution_count": 8, "id": "f3e23aaa-8186-4c3f-95f0-b5e8cc2f89f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "/Users/johnnie/Sync/dev/python/xyzpy/docs/examples/.xyz-first_run\n", "--------------------------------------------------------=========\n", "10 / 20 batches of size 5 completed\n", "[########## ] : 50.0%\n", "\n" ] } ], "source": [ "print(c)" ] }, { "cell_type": "markdown", "id": "661af806-7cd5-4473-8bb6-e1561067893b", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "If we were on a batch system we could use {meth}`xyzpy.Crop.grow_cluster` to automatically\n", "submit all missing batches as jobs. It is worth double checking the script that\n", "is used first though! This is done using {meth}`xyzpy.Crop.gen_cluster_script`:" ] }, { "cell_type": "code", "execution_count": 9, "id": "7219be85-376c-43e3-bb5b-080aa87235b5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#!/bin/bash -l\n", "#$ -S /bin/bash\n", "#$ -N first_run\n", "#$ -l h_rt=0:20:0,mem=1G\n", "#$ -l tmpfs=1G\n", "mkdir -p /Users/johnnie/Scratch/output\n", "#$ -wd /Users/johnnie/Scratch/output\n", "#$ -pe smp None\n", "\n", "#$ -t 1-10\n", "echo 'XYZPY script starting...'\n", "cd /Users/johnnie/Sync/dev/python/xyzpy/docs/examples\n", "export OMP_NUM_THREADS=None\n", "export MKL_NUM_THREADS=None\n", "export OPENBLAS_NUM_THREADS=None\n", "export NUMBA_NUM_THREADS=None\n", "\n", "conda activate dev:py314\n", "read -r -d '' SCRIPT << EOM\n", "#\n", "from xyzpy.gen.cropping import grow, Crop\n", "if __name__ == '__main__':\n", " crop = Crop(name='first_run', parent_dir='/Users/johnnie/Sync/dev/python/xyzpy/docs/examples')\n", " print('Growing:', repr(crop))\n", " grow_kwargs = dict(\n", " num_workers=None,\n", " subprocess=False,\n", " debugging=False,\n", " verbosity_grow=2,\n", " )\n", " batch_ids = (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)]\n", " crop.grow(batch_ids[$SGE_TASK_ID - 1], **grow_kwargs)\n", "EOM\n", "/Users/johnnie/Sync/dev/python/.pixi/envs/py314/bin/python -c \"$SCRIPT\"\n", "echo 'XYZPY script finished'\n", "\n" ] } ], "source": [ "print(c.gen_cluster_script(scheduler=\"sge\", minutes=20, gigabytes=1))" ] }, { "cell_type": "markdown", "id": "80a0d47f-2ffb-42df-9d09-941f61bafa45", "metadata": {}, "source": [ "The default ``scheduler`` is ``'sge'`` (Sun Grid Engine),\n", "however you can also specify ``'pbs'`` (Portable Batch System)\n", "or ``'slurm'``:" ] }, { "cell_type": "code", "execution_count": 10, "id": "e5ac5eae-8161-4aa1-800d-bdb3be5e3cd9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#!/bin/bash -l\n", "#PBS -N first_run\n", "#PBS -lselect=None:ncpus=None:mem=1gb\n", "#PBS -lwalltime=00:20:00\n", "\n", "#PBS -J 1-10\n", "echo 'XYZPY script starting...'\n", "cd /Users/johnnie/Sync/dev/python/xyzpy/docs/examples\n", "export OMP_NUM_THREADS=None\n", "export MKL_NUM_THREADS=None\n", "export OPENBLAS_NUM_THREADS=None\n", "export NUMBA_NUM_THREADS=None\n", "\n", "conda activate dev:py314\n", "read -r -d '' SCRIPT << EOM\n", "#\n", "from xyzpy.gen.cropping import grow, Crop\n", "if __name__ == '__main__':\n", " crop = Crop(name='first_run', parent_dir='/Users/johnnie/Sync/dev/python/xyzpy/docs/examples')\n", " print('Growing:', repr(crop))\n", " grow_kwargs = dict(\n", " num_workers=None,\n", " subprocess=False,\n", " debugging=False,\n", " verbosity_grow=2,\n", " )\n", " batch_ids = (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)\n", " crop.grow(batch_ids[$PBS_ARRAY_INDEX - 1], **grow_kwargs)\n", "EOM\n", "/Users/johnnie/Sync/dev/python/.pixi/envs/py314/bin/python -c \"$SCRIPT\"\n", "echo 'XYZPY script finished'\n", "\n" ] } ], "source": [ "print(c.gen_cluster_script(scheduler=\"pbs\", minutes=20, gigabytes=1))" ] }, { "cell_type": "code", "execution_count": 11, "id": "af5e4126-463a-435a-bd0b-d237bd18c12d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#!/bin/bash -l\n", "#SBATCH --job-name=first_run\n", "#SBATCH --time=00:20:00\n", "#SBATCH --mem=1G\n", "#SBATCH --array=1-10\n", "echo 'XYZPY script starting...'\n", "cd /Users/johnnie/Sync/dev/python/xyzpy/docs/examples\n", "export OMP_NUM_THREADS=None\n", "export MKL_NUM_THREADS=None\n", "export OPENBLAS_NUM_THREADS=None\n", "export NUMBA_NUM_THREADS=None\n", "\n", "conda activate dev:py314\n", "read -r -d '' SCRIPT << EOM\n", "#\n", "from xyzpy.gen.cropping import grow, Crop\n", "if __name__ == '__main__':\n", " crop = Crop(name='first_run', parent_dir='/Users/johnnie/Sync/dev/python/xyzpy/docs/examples')\n", " print('Growing:', repr(crop))\n", " grow_kwargs = dict(\n", " num_workers=None,\n", " subprocess=False,\n", " debugging=False,\n", " verbosity_grow=2,\n", " )\n", " batch_ids = (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)\n", " crop.grow(batch_ids[$SLURM_ARRAY_TASK_ID - 1], **grow_kwargs)\n", "EOM\n", "/Users/johnnie/Sync/dev/python/.pixi/envs/py314/bin/python -c \"$SCRIPT\"\n", "echo 'XYZPY script finished'\n", "\n" ] } ], "source": [ "print(c.gen_cluster_script(scheduler=\"slurm\", minutes=20, gigabytes=1))" ] }, { "cell_type": "markdown", "id": "5ec77731-2fa7-4626-bdfa-7648812b436b", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "If you are just using the ``Crop`` as a persistence mechanism,\n", "then {meth}`xyzpy.Crop.grow` or {meth}`xyzpy.Crop.grow_missing`\n", "will process the batches in the current process:" ] }, { "cell_type": "code", "execution_count": 12, "id": "5dceeaaf-60a7-4e33-8dae-db3e5e7c16be", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Grow: 100%|##########| 10/10 [00:00<00:00, 11.12it/s]\n" ] } ], "source": [ "c.grow_missing(parallel=True) # this accepts combo_runner kwargs" ] }, { "cell_type": "code", "execution_count": 13, "id": "0a216431-ef8c-431e-9ec6-965eed7ff1d9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "/Users/johnnie/Sync/dev/python/xyzpy/docs/examples/.xyz-first_run\n", "--------------------------------------------------------=========\n", "20 / 20 batches of size 5 completed\n", "[####################] : 100.0%\n", "\n" ] } ], "source": [ "print(c)" ] }, { "cell_type": "markdown", "id": "3b7964dd-0bf2-43fc-bb60-026c7f4da2fa", "metadata": { "raw_mimetype": "text/restructuredtext", "tags": [] }, "source": [ ":::{hint}\n", "If different function calls might take different amounts of time based on their arguments,\n", "you can supply ``shuffle=True`` to {meth}`xyzpy.Crop.sow_combos`. Each batch will then\n", "be a random selection of cases, which should even out the effort each takes as long as\n", "``batchsize`` is not too small.\n", ":::" ] }, { "cell_type": "markdown", "id": "1ac81b0c-8489-4869-9f38-c57f529cf7c8", "metadata": {}, "source": [ "## Reap the results\n", "\n", "The final step is to **'reap'** the results from disk. Because the crop was instantiated from a ``Harvester``, that harvester will be automatically used to collect the resulting dataset and sync it with the on-disk dataset:" ] }, { "cell_type": "code", "execution_count": 14, "id": "d931ad50-76f6-49c9-8a94-6390b0fdba60", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Reap: 100%|##########| 100/100 [00:00<00:00, 53274.53it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
<xarray.Dataset> Size: 2kB\n",
       "Dimensions:  (a: 10, b: 10)\n",
       "Coordinates:\n",
       "  * a        (a) int64 80B 0 1 2 3 4 5 6 7 8 9\n",
       "  * b        (b) int64 80B 0 1 2 3 4 5 6 7 8 9\n",
       "Data variables:\n",
       "    sum      (a, b) int64 800B 0 1 2 3 4 5 6 7 8 ... 10 11 12 13 14 15 16 17 18\n",
       "    diff     (a, b) int64 800B 0 -1 -2 -3 -4 -5 -6 -7 -8 ... 8 7 6 5 4 3 2 1 0
" ], "text/plain": [ " Size: 2kB\n", "Dimensions: (a: 10, b: 10)\n", "Coordinates:\n", " * a (a) int64 80B 0 1 2 3 4 5 6 7 8 9\n", " * b (b) int64 80B 0 1 2 3 4 5 6 7 8 9\n", "Data variables:\n", " sum (a, b) int64 800B 0 1 2 3 4 5 6 7 8 ... 10 11 12 13 14 15 16 17 18\n", " diff (a, b) int64 800B 0 -1 -2 -3 -4 -5 -6 -7 -8 ... 8 7 6 5 4 3 2 1 0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c.reap()" ] }, { "cell_type": "markdown", "id": "03670e7b-a431-4bed-9d33-cbd3caa778ef", "metadata": {}, "source": [ ":::{hint}\n", "If the ``Crop`` is incomplete but has some results, you can call `crop.reap(allow_incomplete=True)` to harvest the existing data.\n", ":::\n", "\n", ":::{hint}\n", "You can supply other kwargs related to harvesting such as ``overwrite=True``, which is useful when you want to replace existing data with newer runs without starting over.\n", ":::\n", "\n", "The dataset `foo_data.h5` should be on disk, and the crop folder cleaned up:" ] }, { "cell_type": "code", "execution_count": 15, "id": "bd9effac-280d-4383-ae3d-c7ef7688efa0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[34m.\u001b[m\u001b[m/ dask distributed example.ipynb\n", "\u001b[34m..\u001b[m\u001b[m/ farming example.ipynb\n", "basic output example.ipynb foo_data.h5\n", "complex output example.ipynb visualize linear algebra.ipynb\n", "crop example.ipynb\n" ] } ], "source": [ "!ls -a" ] }, { "cell_type": "markdown", "id": "c0ad3d34-7f3b-4d7d-afe3-22e89b66327f", "metadata": {}, "source": [ "And we can inspect the results:" ] }, { "cell_type": "code", "execution_count": 16, "id": "a0bc495a-ce4f-4af6-93c9-be43fbf22a1e", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "" ], "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "h.full_ds.xyz.plot(x=\"a\", y=\"b\", z=\"diff\");" ] }, { "cell_type": "markdown", "id": "367e019f-f48f-4879-9365-d6d827ff65f5", "metadata": {}, "source": [ "Many crops can be created from the harvester at once, and when they are reaped, the results should be seamlessly combined into the on-disk dataset." ] }, { "cell_type": "code", "execution_count": 17, "id": "9d93a4f6-94f3-4603-aa95-f3f913d73d97", "metadata": {}, "outputs": [], "source": [ "# for now clean up\n", "h.delete_ds()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3" } }, "nbformat": 4, "nbformat_minor": 4 }